We were given 30 State of the Nation (SONA) speeches from 1994 to 2018 to analyse. The specific objectives are to:
We collaborated using the following GitHub location: https://github.com/samperumal/dsi-assign2.
We initially split the work as follows and each of us created a folder with our names to push our work to for others to view:
We presented our work to each other and made suggestions for improvement. Before diving into any prediction, we felt it was important to do an Exploratory Data Analysis (EDA) to get a sense of the high level overview of the dataset. This was done by Audrey.
The initial results from the Neural Net gave a 65% accuracy on the validation set. In order to achieve a higher accuracy, we attempted to feed the results of the Topic Modelling and Sentiment Analysis into the Neural Net. Thus, we needed to understand from each other what the output of these 2 methods were and the input required by the neural net to get the data into a useable format which took some discussion and a few iterations.
Given the low accuracy of the neural net (NN), we tried a Convolutional Neural Net (CNN). Sam got the initial model working. Merve made improvements. Vanessa tuned the hyperparamters. The result is discussed in this document and is a collaborative effort.
The CNN did not provide an improvement over the initial NN, and as a result we tried a Recurrent Neural Net (RNN). This takes ina sequence of data and gives importance to the order of the words in order to make a prediction.
Initially we each performed our own import of the data, splitting out the year and president and tokenisation but we realised there was duplication of effort here and different naming conventions which made it difficult to collaborate and use each other’s output. In addition, Sam noticed that some of the data was not loaded due to special characters. Sentences were not being tokenised correctly for various reasons and he became responsible for performing the data clean up (preprocessing) and outputting a .RData file that everyone could use to run our work.
The data as provided consisted of 30 text files, with filenames encoding the president’s name, the year of the speech, and whether it was pre/post an election (which is absent in non-election years). In working through the files, we discovered that two files were identical which was corrected in the data source with replacement. Additionally, in reading the files, we also identified 3 files that had one or more bytes which caused issues with the standard R file IO routines. Specifically 1 file had a leading Byte-Order-Mark (BOM) which is unique to windows operating system files, and 2 other files had invalid unicode characters, which suggests a text-to-speech processing application was used and experienced either transmissionor storage errors. In all the cases the offending characters were simply removed from the input files.
Having fixed basic read issues, we then examined the content of each file and the simplistic tokenisation achieved by applying unnest_tokens to the raw lines read in from the files. Several issues were uncovered, and in each case a regular expression was created to correct the issue in the raw read lines:
There are multiple characters which are not handled correctly by the default parser, particularly where unicode characters are substituted for standard ASCII characters or are erroneously inserted as part of the text capture. This range of characters were simply removed from the text: "“”%’‘–+¬>-.
Forward-slashes were converted to a space, to handle both options (hot/cold) and numeric ranges (1998/99).
Bullet-pointed lists are interpreted as a single, exceptionally long sentence by default. We chose to split this up into a lead-in sentence terminated by a colon, and a list of sentences starting after the bullet point character (*).
Numbers (both with and without thousand separator characters) and currency values with leading currency symbols (R/$) were removed.
Specific punctuation (ellipsis, colon, semi-colon) was considered equivalent to a sentence separator and converted to a full stop: :;…¦….
Full stops separated by only whitespace were considered redundancy and collapsed to a single whitespace character.
All contiguous whitespace was collapsed to a single whitespace character.
The unnest_tokens function relies on each new sentence starting with a capital letter. After the above fixes, it was was therefore necessary to capitalise every character after full stop, to ensure it is recognised as the start of a new sentence.
Having fixed the text to allow correct sentence tokenisation, and applied the unnest_tokens function, we then determined a unique ID for each sentence by applying a hash digest function to the sentence text. This unique ID allowed everyone to work on the same data with confidence, and enabled us to detect 72 sentences that appeared identically in at least 2 speeches. As these duplicates would potentially bias the analysis and training, all instances of duplicates were removed from the dataset.
One final note is that each speech starts with a very similar boiler plate referencing various attendees to the SONA in a single, run-on sentence. We believe this header does not add significantly to the content of the speech, and so we excluded all instances across all speeches.
The figure above shows the change in number of sentences per president after filtering. On the whole there are more sentences per president, with only a single reduction. Additionally, the highest increases are associated with the files where read-errors prevented us from previously reading the entire file. This change is equally evident in the boxplots below, which show the change in distribution per president of words and characters per sentence.
Overall there is a much tighter grouping of sentences, with less variation and more conistent lengths, which is useful for techniques which depend on equal length inputs, such as some of the Neural Networks. The final histogram below shows the histogram of number of sentences per year/president after filtering, which still bears the same basic shape as before filtering, but with a better profile.
For all group work, we separated our full dataset into a random sampling of 80% training and 20% validation data, which was saved into a common .RData file. This ensured that there would be consistency across the data we were working on so that we could use each others work and compare results consistently.
The graphs above make it clear that our data is also very unbalanced. In an attempt to correct for this, we applied supersampling with replacement to the training dataset to ensure an equal number of sentences per president. Training was attempted using both balanced and unbalanced training data, but it did not appear to make much difference. Balancing was conducted on the training dataset only to ensure there are no duplicates in the validation set that might skew validation error.
Each president has made a certain number of SONA speeches, depending on their term in office and whether there was 1 speech that year or 2 in the year of an election (pre and post election). Since the data is dependent on their term in the office it is unbalanced. Sentence counts per president after cleaning the data is :
## [1] "President sentence counts:"
##
## deKlerk Mandela Mbeki Motlanthe Ramaphosa Zuma
## 103 1879 2803 346 240 2697
## [1] "Baseline_accuracies"
##
## deKlerk Mandela Mbeki Motlanthe Ramaphosa Zuma
## 1.276648 23.289539 34.742191 4.288547 2.974715 33.428359
Let us understand the number of words used by each President and how this varies across each SONA speech.
We need to create a metric called “avg_words” which is simply the total number of words across all SONA speeches made by a particular president, divided by the total number of SONA speeches that president made.
## Joining, by = "president"
| president | num_words | num_speeches | avg_words |
|---|---|---|---|
| Mbeki | 29952 | 9 | 3328 |
| Motlanthe | 3206 | 1 | 3206 |
| Mandela | 16801 | 6 | 2800 |
| Zuma | 23554 | 9 | 2617 |
| Ramaphosa | 2258 | 1 | 2258 |
| deKlerk | 783 | 1 | 783 |
On average, Mbeki used the most words in his SONA speeches, followed by Motlanthe and de Klerk used the least. Mandela and Zuma are ranked in the middle of their peers. The current president (Ramaphosa) used fewer words than all of his post 1994 peers.
Of the 3 presidents that have made more than 1 SONA speech, Mbeki used more words on average than both Mandela and Zuma and the variance in the number of words used per SONA speech is also higher for Mbeki. In 2004, which was an election year, the average number of words Mbeki used was lower in both his pre- and post-election speeches. Towards the end of his term, his average number of words also dropped off. The data suggests that perhaps Mbeki’s average number of words is correlated with his confidence in being re-elected President.
Lexical diversity refers to the number of unique words used in each SONA.
The number of unique words per SONA ranges from about 700 for de Klerk in 1994 to over 2500 with Mandela in his post election speech of 1999. Mbeki’s post election speech of 2004 and Zuma’s post election speech of 2014 reached close to the 2500 mark.
It’s interesting that whilst the trend in the number of unique words used increased for Mandela, Mbeki and Zuma both show a upward trend in the lead up to the election year, followed by a downward trend after elections, despite nearing the 2500 unique words mark in their post election speeches.
If we exclude the post election speeches, the number of unique words used by Mbeki during his term from 2000 to 2008 averages just under 2000 whereas the number of unique words used by Zuma during his term from 2009 to 2017 averages just over 1500.
Lexical density refers to the number of unique words used in each SONA divided by the total number of words and a high value is an indicator of word repitition.
De Klerk repeated over 30% of his words in his 1994 pre election SONA speech. On average, Mandela repeated about 25% of words in each of his SONA speeches and this reduced to about 20% in the post election speech of 1999. Mbeki’s repitition rate was about 23% and this reduced to 20% in the post election speech of 2004. Zuma’s repitition rate is over 30% with the exception of his post election speech of 2014 at about 23%.
The “bing” lexicon encodes words as either “positive” or “negative”. However, not all words used in the SONA speeches are in the lexicon so we need to adjust for this.
Let’s understand how many “positive” and “negative” words are used by each president across all their SONA speeches and create a metric called “sentiment” which is simply the total number of positive words minus the total number of negative words. We then adjust for the total number of words used from the lexicon in the “sentiment_score” metric.
## Joining, by = "word"
| president | negative | positive | sentiment | sentiment_score |
|---|---|---|---|---|
| Zuma | 788 | 1466 | 678 | 30.08 |
| Ramaphosa | 102 | 181 | 79 | 27.92 |
| Mbeki | 1314 | 2287 | 973 | 27.02 |
| Motlanthe | 180 | 263 | 83 | 18.74 |
| Mandela | 1036 | 1434 | 398 | 16.11 |
| deKlerk | 64 | 58 | -6 | -4.92 |
Of the 3 presidents that have made more than 1 SONA speech, Zuma has the highest sentiment score, followed by Mbeki and then Mandela. Zuma’s sentiment score is nearly double Mandela’s. It’s interesting that the current President, Ramaphosa, has the second highest sentiment score, not far behind Zuma and only slightly ahead of Mbeki.
## Joining, by = "word"
De Klerk’s most used words were “freedom”, “peaceful” and “support” and at least 2 of these 3 come up in all the president’s most used words. Mandela’s most used words include “progress”, “improve”, “reconciliation” and “commitment” which are all words indicating repair and a move towards something better. Mbeki uses many of the same words but also introduces “empowerment” which is a word carried through by Zuma and Ramaphosa, and “success” which is carried through by Zuma. This is likely due to the fact that Black Economic Empowerment (BEE) was introduced under Mbeki and was a policy carried through by Zuma and Ramaphosa. In addition, these words suggest progress in the move towards repair or something better, first spoken about by Mandela. Ramaphosa also introduces the words “confidence”, “effectively”, “enhance” and “efficient”, which are words commonly seen in a business context and have not shown up in any other SA president’s top 10 most frequently used words in a SONA since 1994.
## Joining, by = "word"
## Joining, by = "president"
Common positive words across post 1994 presidents include: “freedom”, “regard”, “support”, “improve” and “progress”. Words introduced by Mandela and unique to his speeches are: “restructuring”, “reconciliation”, “committment”, “contribution” and “succeed”. Mbeki introduces the words “empowerment”, “comprehensive”, “integrated” and “improving” into the top words used and this is unique to his speeches. Zuma uses the words “success”, “reform” and “pleased” frequently and other presidents do not. Ramaphosa introduces the words “significant”, “productive”, “confidence” and “effectively” which have not yet been seen in the any other SA president’s top 10 most frequently used words in a SONA since 1994.
## Joining, by = "word"
Common negative words pre 1994 include: “concerns”/“concern”/“concerned”, “unconstitutional”, “illusion”, “hopeless”, “disagree”, “deprive”, “conflict”, and “boycott”.
Common negative words post 1994 include: “corruption”, “crime”/“criminal”, “poverty”/“poor”, “inequality”, “issue”/“issues” and “crisis”.
A negative word introduced by and unique to Mandela’s top 10 is “struggle”. Mbeki is the only president with the word “racism” in his top 10 negative words. Motlanthe has “conflict” in his top 10 which no other president does. Zuma has “rail” which likely refers to the railway system and does have negative connotations for South Africa. Both Zuma and Ramaphosa use the word “difficult” a lot. Ramaphosa introduces the word “expropriation” into the top 10 for the first time amongst his peers.
## Joining, by = "word"
## Joining, by = "president"
The interpretation is much the same as before. Note the clear separation between the top 10 negative words used pre and post 1994 elections, indicative of the pre and post apartheid regimes.
## Joining, by = "word"
## Joining, by = "year"
The 2 vertical black lines are drawn at 60% and 70% positivity rates. In the majority of years, SONA speeches fall within this range of positivity however a there are a few more negative speeches in earlier years and a few more positive speeches in later years.
The trend appears to be more positive and less negative over time but how can we be sure?
We will test whether negative sentiment is increasing or decreasing, then we will test whether positive sentment is increasing or decreasing . We will use a Binomial model because the frequencies are between 0 and 1. Finally, we will test whether average sentiment is increasing or decreasing using a linear model.
##
## Call:
## glm(formula = freq ~ as.numeric(year), family = "binomial", data = subset(sentiments_relative,
## sentiment == "negative"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.21779 -0.05759 0.01224 0.07113 0.19692
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.88105 115.18655 0.225 0.822
## as.numeric(year) -0.01316 0.05743 -0.229 0.819
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.39087 on 24 degrees of freedom
## Residual deviance: 0.33826 on 23 degrees of freedom
## AIC: 27.484
##
## Number of Fisher Scoring iterations: 3
The slope is negative but the beta of the year variable is not significant so we cannot conclude that negative sentiment is decreasing over time.
##
## Call:
## glm(formula = freq ~ as.numeric(year), family = "binomial", data = subset(sentiments_relative,
## sentiment == "positive"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.19692 -0.07113 -0.01224 0.05759 0.21779
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.88105 115.18655 -0.225 0.822
## as.numeric(year) 0.01316 0.05743 0.229 0.819
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.39087 on 24 degrees of freedom
## Residual deviance: 0.33826 on 23 degrees of freedom
## AIC: 27.484
##
## Number of Fisher Scoring iterations: 3
The slope is positive but the beta of the year variable is not significant so we cannot conclude that positive sentiment is increasing over time.
## Joining, by = "word"
##
## Call:
## glm(formula = avg_sentiment ~ as.numeric(year), family = "gaussian",
## data = sentiments_per_year)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -21.252 -7.877 -1.282 8.584 21.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1369.9476 652.4104 -2.100 0.0460 *
## as.numeric(year) 0.6952 0.3253 2.137 0.0425 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 153.4216)
##
## Null deviance: 4536.4 on 26 degrees of freedom
## Residual deviance: 3835.5 on 25 degrees of freedom
## AIC: 216.44
##
## Number of Fisher Scoring iterations: 2
The slope is positive and the beta of the year variable is significant at 1% so we can conclude that average sentiment is increasing over time.
But we need to be cautious with this interpretation because what could actually be going on here is that the “bing”" lexicon has more than double the number of negative words than positive words so this could be influencing the results and SONA speeches may in fact be more positive than they appear to be.
##
## negative positive
## 4782 2006
## Joining, by = "word"
Apart from the last 2 presidents, Ramaphosa and Zuma, the presidents are in time order. We can see that other than Motlanthe, the trend is an increasing average sentiment over time but at a decreasing rate. The interquartile range of Mbeki is smaller than Zuma’s which is smaller than Mandela’s.
## Joining, by = "word"
Average sentiment is the proportion of positive words out of all the words in the “bing” lexicon. Mandela shows a very erratic average sentiment, ranging from 0 to over 25. Mbeki and Zuma’s average sentiment mostly ranges between 25 and 50, with the exception of a few such as 2000, 2008, 2012, 2017.
Afinn lexicon measures positivity on a scale from -5 negative to +5 positive.
## Joining, by = "word"
| score | n | weighted_score |
|---|---|---|
| -5 | 2 | -10 |
| -4 | 27 | -108 |
| -3 | 689 | -2067 |
| -2 | 1059 | -2118 |
| -1 | 867 | -867 |
| 1 | 2422 | 2422 |
| 2 | 3525 | 7050 |
| 3 | 379 | 1137 |
| 4 | 42 | 168 |
| 5 | 32 | 160 |
The most number of words are scored positive 2, followed by positive 1. This becomes even more pronounced when scores are multiplied by counts to get weighted scores. The distribution of all “afinn” words is as follows:
##
## -5 -4 -3 -2 -1 0 1 2 3 4 5
## 16 43 264 965 309 1 208 448 172 45 5
Words with a score of -2 dominate the lexicon, followed by words with a score of 2. We found a relatively high number of words with a score of 2 in this analysis but it is unlikely to only be a result of its prevalence in the lexicon and we can conclude that it is probably an accurate assessment of the sentiment that prevails in the text.
## Joining, by = "word"
The interpretation is much the same as with the “bing” lexicon in that the trend is an increasing average sentiment over time however Zuma’s median sentiment is lower than the general trend.
Mandela and Zuma show a wave-like pattern of sentiment. Mbeki shows an increasing and then decreasing pattern.
The nrc lexicon infers emotion with certain words.
## Joining, by = "word"
## Joining, by = "president"
The distribution of all “nrc” words is given by:
##
## anger anticipation disgust fear joy
## 1247 839 1058 1476 689
## negative positive sadness surprise trust
## 3324 2312 1191 534 1231
Words can be assigned more than 1 sentiment but we do not expect as many words to come up under “anticipation”, “joy” or “surprise” given the relatively lower counts in the lexicon. So “anticipation” has a surprisingly high relative count across all presidents.
Given that “positive” sentiment is the most frequent classification in the “nrc” lexicon, it is not surprising that it comes out as the most frequently assigned classification across all presidents. The distributions across the various sentiments are very similar for all presidents so this lexicon does not provide any insights about specific presidents.
## Joining, by = "word"
The negative most used words which are also associated with the “anger”, “disgust”, fear" and “sadness” emotions are: “violence”, “struggle” and “poverty”.
The positive most used words which are also associated with the “anticipation”, “joy” and surprise" emotions are: “youth”, “public” and “progress”.
The most usedwords that evoke the “trust” emotion are: “system”, “president”, “parliament” and “nation”.
It just so happens that the 4 negation words are also stop words so have already been removed from the Bigrams and we need to add them back. This can be shown as follows:
## [1] "not" "not" "not" "no" "no" "no" "never"
## [8] "never" "without" "without"
Let’s redo the bigrams without removing stop words and see how many bigrams contain 1 of the negation words:
## # A tibble: 1 x 1
## n
## <int>
## 1 118
There are only 118 bigrams that contain negation words. Let’s look at a few examples:
## Selecting by id
| year | word1 | word2 | sentiment1 | sentiment2 | president |
|---|---|---|---|---|---|
| 1994 | no | illusions | neutral | negative | deKlerk |
| 1994 | no | doubts | neutral | negative | deKlerk |
| 1994 | no | right | neutral | positive | deKlerk |
| 1994 | no | illusion | neutral | negative | deKlerk |
| 1994 | no | illusions | neutral | negative | deKlerk |
| 1995 | not | succeed | neutral | positive | Mandela |
| 1997 | not | falter | neutral | negative | Mandela |
| 1997 | not | shirk | neutral | negative | Mandela |
| 1998 | no | magic | neutral | positive | Mandela |
| 1999 | without | regard | neutral | positive | Mandela |
| 2001 | no | benefit | neutral | positive | Mbeki |
| 2004 | without | undue | neutral | negative | Mbeki |
| 2006 | not | wrong | neutral | negative | Mbeki |
| 2006 | not | dead | neutral | negative | Mbeki |
| 2008 | not | disappoint | neutral | negative | Mbeki |
| 2009 | not | lose | neutral | negative | Motlanthe |
| 2009 | not | detract | neutral | negative | Motlanthe |
| 2009 | without | undue | neutral | negative | Motlanthe |
| 2009 | not | suffer | neutral | negative | Motlanthe |
| 2009 | not | underestimate | neutral | negative | Motlanthe |
| 2018 | no | liberation | neutral | positive | Ramaphosa |
| 2018 | not | displace | neutral | negative | Ramaphosa |
| 2009 | not | falter | neutral | negative | Zuma |
| 2009 | not | backward | neutral | negative | Zuma |
| 2014 | not | sufficiently | neutral | positive | Zuma |
| 2014 | not | well | neutral | positive | Zuma |
| 2017 | not | worked | neutral | positive | Zuma |
Let’s see how many there are per president:
## Joining, by = "president"
| president | n | total | perc |
|---|---|---|---|
| deKlerk | 5 | 237 | 2.11 |
| Mandela | 39 | 4700 | 0.83 |
| Mbeki | 34 | 9613 | 0.35 |
| Motlanthe | 7 | 1067 | 0.66 |
| Ramaphosa | 2 | 742 | 0.27 |
| Zuma | 31 | 9296 | 0.33 |
Given that there is such a low percentage of bigrams with negation words, it is not expected to significantly change the interpretation above and recoding the sentiments is not justified.
An effective topic model can summarise the ideas and concepts within a document - this can be used in various ways. A user can understand the main themes within the corpus of documents and draw conclusions from these from analysis of these topics or they can use the information as type of dimensional reduction and feed these topics into different supervised or unsupervised algorithms.
In this project, our group has used topic modelling to better understand the common topics that come up over the SONA speeches, how these are related to different presidents and speeches and how they change over time. In addition, the probability that a sentence belongs to a certain topic was used in an attempt to classify which sentence was said by which president (see Section XX)
The data used in this section is the clean and processed data as described in Section X. The resulting sentence data has been used and dissected further without consideration to train and validation unless otherwise stated.
The following methodology was followed:
Figure: Most popular terms
After tokenisation and removal of stop words, the top 20 most used terms across all of the SONA speeches are displayed. Unsurprisingly, “South Africa” is the most used term followed closely by “South African” and “South Africans” and “Local Government”. These terms do not add to our understanding of the topics and tend to confuse the topic modelling going forward. The removal of the terms allows for a cleaner interpretation. "Public service is then the most used term.
A pre-requisite of topic modelling is knowing the number of topics that each corpus may contain (i.e. the latent factor k) In some cases, this may be a fair assumption but without reading though each speech, how could one know how many different topics have been articulated in the SONA’s? Luckily, Murzintcev Nikita has published an R- package (ldatuning) that helps to optimise the number of topics (k) over three different measures. The measures used to determine the number of topics, are discussed in an RPubs paper which can found here: link and the following optimisation largely follows the accompanying vignette: [link] (https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html)
The following extract from the RPub paper gives a brief explanation of the methods used to optimise for k:
Extract from RPubs
"Arun2010: The measure is computed in terms of symmetric KL-Divergence of salient distributions that are derived from these matrix factor and is observed that the divergence values are higher for non-optimal number of topics (maximize)
CaoJuan2009: method of adaptively selecting the best LDA model based on density.(minimize)
Griffths: To evaluate the consequences of changing the number of topics T, used the Gibbs sampling algorithm to obtain samples from the posterior distribution over z at several choices of T(minimize)"
In addition to this, Nikita considers how the number of k may change over a validation or hold out sample. His term for this is“perplexity” which he defines as “[it] measures the log-likelihood of a held-out test set; Perplexity is a measurement of how well a probability distribution or probability model predicts a sample”
Below is an attempt to optimise for k and to check that the choice of k holds over an unseen data set.
Figure: Optimisation Metrics
From the above plot, the marginal benefit from adding another topic, stops at around 8-10 topics. In order to test this, the “perplexity” over a test sample for the document term matrix can be checked.
Figure: Perplexity Plot
As more topics are used, the perplexity of the training sample does decrease but that of the test sample increases from around 11 topic. The perplexity of the test sample seems to be minimised at around 8 topics.
The evidence from these two plots suggest that the optimal number topics sits at around 8 topics.
For this assignment, Latent Dirichlet allocation (LDA) was used for the topic modelling. Other methods, such as a Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (pLSA) could have been used but LDA is useful due to the fact that it allows: 1. Each document within the corpus to be a mixture of topics 2. Each topic to be a mixture of bigrams 3. The topics are assumed to be drawn from Dirichlet distribution (i.e. not k different distributions as with pLSA) so there are less parameters to estimate and no need to estimate the probability that the corpus generates a specific document.
The beta matrix produced gives the probability of the topic producing that bigram (i.e. that the phrase is in reference to that topic.) From this measure, one can get a sense of what the character of the topic is. By using the most popular phrases in each topic, understanding of the flavour of each topic emerges. However, it must be kept in mind that terms can belong to more than one topic so applying some logic to get a theme or flavour should be done liberally.
From the display of popular terms, it can be determined that the topic one has a vague connection to “job creation” - this is the most common terms but is supported by other terms that have a high probability of being generated by this topic such as: + “world cup” + “national youth” + “infrastructure development”
These concepts all support the idea of the job creation as each of these will generate jobs for the country. But there is some noise in the topic for “address terms” i.e. honourable speaker or honourable chairperson. “Nelson Mandela” and “President Mandela” crop up too which suggests that alongside the job creation theme, there exists some of what can be termed “terms of endearment”
As with the previous topic, there is some random “terms of endearment” in this topic as well (i.e. “madam speaker”) but it is not as evident as in the first topic. This is to be expected as bigrams can be generated by more than one topic as each topic is a mixture of bigrams! The next four terms sums out the main themes for this topic: + “Economic Empowerment” + “Black Economic” + “Justice System” + “Criminal Justice”
In summary, this topic can be summed as “Economy/ Criminal and Justice System”
Despite the most popular terms being “United Nation” and “private sector”, there a theme that is “developing”. As in development plan, resource development, national development and development programme etc. And thus, the topic is named.
Once again, there is a “term of endearment” in the popular terms (“fellow south” which is assumed short for “fellow South African’s” which is one of former President Zuma’s favourite phrases). With all the other terms combined, a theme of “Social Reform/ Regional and Municipal Government” takes shape.
Given that there is a possible trigram evident here, it may be worth exploring in future work.
“Public sector” and “private sector” are popular terms in topic 5. After consideration of the various other terms, of which some have cross over with other topics and discussion, the eventual name for this topic became “Public Sector Entities”
A different way of looking at this topic could be to investigate the biggest differential in terms between topics. For instance, using the log(base 2) ratio between topic 1 and topic 5, shows the terms that have the widest margin between the two topic (i.e. are far more likely to be in topic 5 versus topic 1)
Figure: WordCloud for Topic 5
For instance, “social programmes”, “human fulfilment”, “rights commission” are all generated in significantly larger proportions by Topic 5 compared to Topic 1 while “national social”, “training colleagues” and “sector unions” all exist with in Topic 1.
Given the naming of Topic 5 as “Public Sector Entities” and Topic 1 as “Job Creation/Terms of Endearment” these terms do seem to be grouped in line with expectation.
The LDA model allows each of the sentence to be represented as a mixture of topics. The gamma matrix shows the document-topic probability for each sentence. i.e. the probability that each sentence is drawn from that topic. For instance, the follow sentence sampled from random shows that it has a 0.905% probability of being drawn from topic 4 based on the use of the bigrams within it. The sentence appears to be talking about the water and the infrastructure around it. The label for topic of was “Social Reform/ Regional and Municipal Government” and this statement seems to be somewhat relevant to it.
| president | year | sentence | X1 | X2 | X3 | X4 | X5 |
|---|---|---|---|---|---|---|---|
| Zuma | 2010 | yet, we still lose a lot of water through leaking pipes and inadequate infrastructure. | 0.023575 | 0.023575 | 0.023575 | 0.9057001 | 0.023575 |
Using this method, the sentences can be roughly classified to a topic based on the probabilities (i.e. classify the sentence by the topic with the highest probability) and further analysis can be conducted.
(Note: the which.is.max breaks ties at random so where a sentence has equal probabilities, is will decide at random to which topic it gets assigned)
Figure: WordCloud for Topic 5
Consider the mixture of topics that each individual president covers during the SONA address. Despite the imbalance in the number of sentences said by each president, there seems to be a fairly standard shape to the topics discussed. The two exceptions to this are de Klerk and Zuma. All other presidents tend to send around 10-15% on topic 1 (“Job Creation/Terms of Endearment”) and the 15-20% on Topic 2 (“Economy/Criminal and Justice System”), Topic 3 (“Development”) and Topic 4 (“Social Reform/Regional and Municipal Government”) and the around another 10% on Topic 5 (“Public Sector Entities”). This trend means that it may be difficult for a supervised model to pick up difference in presidents based on the topic covered.
As stated, the only two presidents where this trend differs are President de Klerk and President Zuma. President de Klerk spend the majority of his time on Topic 1 (“Job Creation/Terms of Endearment”) followed by Topic 2 (“Economy/Criminal and Justice System”). Given the context around the time period, it may be unsurprising that “terms of endearment” and “criminal and Justice systems” come up since his speech would be littered with names of people and political parties as well as talking about past injustices.
President Zuma spends the majority of his speeches on Topic 4 (“Social Reform/Regional and Municipal Government”). Once again, given context that his terms as President was marked with service delivery strikes, two major droughts over a few different regions and discussions around and reform this may be unsurprisingly. And in fact, when the most popular terms from topic 4 is recalled (“fellow south”) is may even be predictable that this would be the most “talked about” topic for President Zuma. What is interesting that given the attention to the issues of State Capture that characterised Zuma’s presidency, his coverage of Topic 5 (“Public Sector Enterprises”) is much smaller than that of his peers.
A similar analysis can be taken over time.
Figure: WordCloud for Topic 5
The graph shows that over time, topics 1 and 5 are the least discussed topics while topics 2,3 and 4 all get much the same airtime. There are a number of notable spikes/valleys: + In 1996, Topic 2 (“Economy/Crime and Justice System”) spikes
The 1996 SONA was a few months ahead of the introduction of the new constitution as well as at the time of the start of the Truth and Reconciliation Commission. It could be suggested that these two ideas would drive up the topic in the SONA speech.
In 2005, topic 1 (“Job Creations/Terms of Endearment”) dives while topic 4 (“Social Reform/Region and Municipal Government”) and Topic 2 (“Economy/Criminal and Justice System”) spike considerably
Mbeki’s term in presidency (1998 - 2008) was characterised by a rise in crime specifically in farm attacks as well as the HIV/AID epidemic and the start of the Black Economic Empowerment in 2005 which could attribute the spikes and drops of topics in 2005.
In 2012, Topic 2 dives considerably (“Economy/Criminal and Justice System”)
From various media reports, Zuma’s 2012 SONA speech largely covered the success of the government while skipping over future plans. It may be a reason while Topic 4 (“Social Reform/Regional and Municipal Government”) rises sharply.
One of the aims behind topic modelling is to reduce the dimensions of the data to allow for other techniques to be applied. In this instance, the aim was to reduce the SONA speeches to a collection of topics that would help predict which president was responsible for a sentence in the SONA speech. The assumption was that each president might have a unique set of topics or mixture of topics that could characterise their particular speech. However, there does not seem to be evidence of this. The matrix with the probability of each sentence belonging to a topic is used in Section X and the results are discussed.
The count of each word that have been used in each sentence is what we are going to be feeding in. We need to unnest the sentence data, count each word in each sentence and spread the sentence word counts so that we have sentence id’s in each row and we have each word as column id’s. This is the simplest neural net model that we can try, so that was our first model.
Figure: Bag of words model - neural net training
This model has L2 regularization to avoid overfitting but even so it didn’t help very much. The accuracy is 0.55886. The optimizer_rmsprop has learning rate 0.003. This was choosen after trying lr=c(0.001, 0.002, 0.003) To make readability easier the model with best learning rate is used.
As we can see from the plot model overfits after the second iteration, since the loss function start increasing in value, so to avoid that let’s use a smaller model with less neurons and add a dropout.
Figure: Bag of words model - neural net training
180, 50, 57, 14, 7, 1, 33, 181, 16, 8, 12, 2, 56, 26, 111, 10, 4, 3, 7, 4, 3, 3, 0, 3, 4, 9, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1
Accuracy rate is: 0.5911.
Kappa value tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class.
“Cohen’s kappa is always less than or equal to 1. Values of 0 or less, indicate that the classifier is useless. There is no standardized way to interpret its values. Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement.” [Reference: Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical data”. Biometrics 33 (1): 159-174]
The Kappa value is, 0.416 and this means we are doing better than random.
The accuracy is slightly better than bigger model with no-dropouts (0.581%) Just for the word-count model seems good enough. But this model does not consider how important each word is to its corpus. So we should consider a better model to try.
After the fourth iteration validation loss starts increasing which is a sign of overfitting.
TFIDF is a statisic that shows how important a word is to it’s corpus. So if we are feeding NN with TFIDF we are logically expecting the results to be slightly better than the word-cout NN model.
| 1 | 2 | 3 | 4 | 6 |
|---|---|---|---|---|
| 183 | 46 | 51 | 0 | 0 |
| 48 | 203 | 18 | 1 | 0 |
| 52 | 17 | 118 | 1 | 0 |
| 16 | 8 | 11 | 0 | 0 |
| 5 | 16 | 3 | 0 | 0 |
| 0 | 3 | 4 | 2 | 1 |
Figure: Bag of words model - neural net training
Accuracy rate is: 0.6245.
The accuracy is 0.6158 and the model starts overfitting after fourth epoch. So this is slightly better than the bag of words word-count model as we expected.
| 1 | 2 | 3 |
|---|---|---|
| 143 | 134 | 3 |
| 70 | 199 | 1 |
| 91 | 95 | 2 |
| 15 | 20 | 0 |
| 8 | 15 | 1 |
| 3 | 7 | 0 |
Figure: Bag of words model - neural net training
Accuracy of sentiment analysis is: 0.426 [0.4263].
Sentiment analysis also reaches it’s smallest validation loss value on the fifth iteration. But the train accuracy and test accuracy is changing very slightly at each iteration. This model does not seem to be doing well on eighter the training set or the test set. The test accuracy is 0.4361834 and the training accuracy is 0.4353395. If we look at the NRC sentiment lexicon it is visible that all presidents share same sentiment distribution pattern, so this is why it is not overfitting. Because our set aside test set is actually no different than the training set.
Topic modelling only predicted for president 1(Mbeki) and president 2(Zuma).
Figure: Bag of words model - neural net training
The train and the test set are not very distinct from each other just like sentiment analysis. If we look at the mixture of the topics by each president in topic modelling chunk we can see that all the topics for each president are kind of uniformed and hard to seperate each president’s topic from one another.
We will be using GloVe embeddings. GloVe stands for “Global Vectors for Word Representation” and is an unsupervised learning algorithm for obtaining vector representations for words as it is stated in their website. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. [Reference:Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation:https://nlp.stanford.edu/pubs/glove.pdf ] Specifically, we will use the 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia.
This note is taken from the reference given above and as they state the accuracy they achieved on python is twice as good.
“IMPORTANT NOTE:This example does yet work correctly. The code executes fine and appears to mimic the Python code upon which it is based however it achieves only half the training accuracy that the Python code does so there is clearly a subtle difference. We need to investigate this further before formally adding to the list of examples”
[reference for implementation on Python: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html] [reference for implementation on R: https://keras.rstudio.com/articles/examples/pretrained_word_embeddings.html] [reference for implementation on R: https://github.com/rstudio/keras/blob/master/vignettes/examples/pretrained_word_embeddings.R]
Also for the pre-trained embeddings to work well it needs to be trained with similar type of data that you are trying to classify, and the fact that the GloVe embeddings are trained on Wikipedia data one can expect that it would not neccesarily help predict the presidents better for our sentences.
Majority of the sentences of Mbeki is predicted as Mandela(379/569). Majority of the sentences of Zuma is predicted as Mandela (244/534) and the second majority is predicted as himself (195/534). Majority of the sentences of Mandela is predicted as Mandela(263/381). Majority of the sentences of Mothlane(38/66) is predicted as Mandela. Majority of sentences of Ramaphosa is predicted as Mandela(19/47) or Zuma(15/47). Mojority of the sentences of deKlerk is predicted as Mandela(14/17).
The bag-of-words model as applied to Neural Networks treats each sentence as an unordered list of integer or one-hot-encoded elements. This captures whether a word occurs in a sentence, and the frequency of occurrence for tf-idf models. While this can be effective, it does ignore any signals in the data related to the ordering and relative positions of words. Sequential neural networks address this problem by treating the data as an ordered list of integers using a dictionary that provides a unique mapping between words and integers. The network then applies various layers to this input that attempt to extract the sequential information for use in later standard layers.
For all our sequential neural network attempts, we converted each sentence to a vector of integers using a word dictionary as our x-data, and one-hot-encoded the presidents as our y-data.
An embedding layer is a dimensionality reduction technique that attempts to encode the relationships between words in a sentence, input as a variable length integer array with padding, as a fixed-length floating point vector. An embedding has a tunable hyper-parameter for the number of latent factors to map every sentence on to, where each latent factor attempts to capture a semantic dimension of the sentence as a whole. Embeddings aim to capture the linear substructure of sentences through euclidean distances between words in the n-dimensional unit hypercube, where n is the number of latent factors specified.
Embeddings can be trained on the corpus of sentences that comprise the dataset under investigation, however this can prove limiting if the there is a relatively small quantity of training data. An alternative approach is to re-use a previously trained embedding layer, such as the Glove embedding . This has the advantage of leveraging the results from a much larger, and theoretically more generic, dataset in an application of transfer learning. The SONA data has a large number of non-standard or foreign words included, however, which theoretically limits the applicability of pre-trained embeddings.
A convolutional layer applies a moving weighted-average filter (kernel) over the input data that attempts to extract simple patterns for use in later layers. They are an approach to reducing the dimensionality of input data by using a shared weighting across all input nodes, thereby addressing the exploding/vanishing gradient problem that would otherwise occur with a standard fully-connected layer. By way of example, a 100-node input layer followed by a 50-node fully connected layer would have 5000 weights to fit, whereas with an equivalent convolutional layer there would only be 50 weights.
The convolutional layer has a number of tunable hyper-parameters:
We experimented with a number of different network topologies, and finally settled on the following:
We also experimented with Deep CNN architectures by adding on additional densely connected layers (with accompanying dropout and activation layers) below the Convolutional layer. Despite much experimentation with this, additional layers did not appear to have any noticeable effect on the accuracy of our results.
A Recurrent Neural Network is an attempt to model the relationship between words in a sentence based on their relative positions. It involves repeatedly applying the same layer to each word of a sentence (rather than the sentence as a whole), in a manner that allows the layer to “remember” aspects of the words in the sentence that have already been seen. For our application we used a long-term-short-term (LSTM) memory layer that trains both the weights and the memory of the layer.
We experimented with a number of different network topologies, and finally settled on the following:
RNN’s have been shown to achieve very good performance when applied to Natural Language Processing (NLP) of text, due to the similarities with how humans process language. Unfortunately our applied RNN did not surpass the performance of the other networks we attempted, despite many attempts at tuning. We suspect the sparsity of the dataset played a large role in this result, as it was replicated on both the balanced and unbalanced training data.
## [1] "Test accuracy for the bigger word-count model is 0.596034696517233"
## [1] "Train accuracy for the bigger word-count model is 0.989395400082633"
## [1] "Test accuracy for the smaller word-count model is 0.591078066951428"
## [1] "Train accuracy for the smaller word-count model is 0.949869163953114"
## [1] "Test accuracy for the tf-idf model is 0.625774473395046"
## [1] "Train accuracy for the tf-idf model is 0.870403525611287"
## [1] "Test accuracy for the sentiment analysis nn is 0.42627013645503"
## [1] "Train accuracy for the sentiment analysis nn is 0.434099986334508"
## [1] "Test accuracy for the topic modelling nn is 0.36183395291202"
## [1] "Train accuracy for the topic modelling nn is 0.34995179727259"
test_acc
here we can combine all the train and test accuracies in a table, as a conclusion.
Also done the critisism after each chunk.
https://www.kaggle.com/rtatman/tutorial-sentiment-analysis-in-r
https://www.datacamp.com/community/tutorials/sentiment-analysis-R
https://nlp.stanford.edu/pubs/glove.pdf
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
https://keras.rstudio.com/articles/examples/pretrained_word_embeddings.html
https://github.com/rstudio/keras/blob/master/vignettes/examples/pretrained_word_embeddings.R
Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical data”. Biometrics 33 (1): 159-174